Prepare the required packages
Load the CSV file Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009
Below are the statistic summary for the dataset
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
Our dataset consists of 13 variables, with total 4,898 observations.
The majority of white wine quality is at 5 and 6. And a new variable, quality.factor is created for further investigation.
The first graph is the original distribution of fixed acidity. The second graph is after removing outliers, and it shows a quite normal distribution where peak at around 7g/dm^3.
The first graph is the original distribution of volatile acidity. The second graph is after removing outliers, and it shows a quite normal distribution where peak at around 0.25g/dm^3.
The first graph is the original distribution of citric acidity. The second graph is after removing outliers, and it shows a quite normal distribution where peak at around 0.3g/dm^3.
Fixed acidity, volatile acidity and citric acid are all related to acid and therefore their graphs look alike.
The first graph shows the original distribution of residual sugar that is heavily skewed to the left. After transforming the x-axis using logscale, the second graph appears a bimodal distribution.
The first graph is the original distribution of chlorides. The second graph is after removing outliers, we can see a normal distribution, as from the summary of the dataset, the median is very close to the mean, at around 0.04g/dm^3.
The first graph is the original distribution of free sulfur dioxide. The second graph is after removing outliers, and it shows a distribution slightly skewed to the right, peak at around 30mg/dm^3.
The first graph is the original distribution of total sulfur dioxide. The second graph is after removing outliers, and it shows a distribution slightly skewed to the right, peak at around 110mg/dm^3.
It is found that the distribution of free and total sulfur dioxide is very similar, both have a distribution slightly skewed to the right after removing outlier. But the amount of free sulfur dioxide in all white wines is relatively constant.
There is only a very narrow range of density for the white wines, approximately from 0.098 to 1.04.
The first graph is the original distribution of density. The second graph is after removing outliers, and it shows a distribution slightly skewed to the right, peak at around 0.992g/cm^3.
pH shows a normal distribution, with values concentrated from 3 to 3.3.
The distribution is quite normal after removing the outlier. And it is quite similar to that of free sulfur dioxide and total sulfur dioxide. The relationship between these three variables will be checked in later section of this project.
The distribution of alcohol is alightly right skewed, and is concentrated from 9 to 11.
The data set contains 4,898 white wines with 11 variables on quantifying the chemical properties of each wine. The 11 variables are fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulpahates and alcohol. All the 11 variables are numberic.
At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). The quality of white wine is in integers.
The main feature of interest in the dataset is quality. This project is to explore which chemical properties influence the quality of white wines.
Alcohol, volitaile acid, free sulfur dioxide and total sulfur dioxide seems to affect the quality of white wines the most, according to the documentation. The relationship of them will be explored in the following section.
A new variable, quality.factor is created. It is a ordered factor which will be more useful for further investigation.
There are no missing values in the dataset.
For alcohol, the distribution is a bit right skewed. It is not adjusted as the skewness is not not extreme.
For residual sugar, the distribution is heavily right skewed. After transforming with logscale, it shows a bimodal distribution, with two peaks.
For the rest of the variables, there are many outliers. After removing the outliers, they all show a distribution similar to normal distribution. So it is not needed to perform any other transformation and adjustment.
Below correlation table has shown all the correlations between each variable.
Alcohol has the highest correlation with quality (0.44). As shown above, white wines with quality 7, 8 and 9 have higher median alcohol level at 11% or above.
Density has the second highest correlationwith quality, but in opposite direction. As shown above, median density tend to decrease as quality of white wine increases.
Density has the highest correlation with alcohol, but in opposite direction. As shown above, it shows a clear negative correlation.
To further explore, we could see the positive correlation between density versus residual sugar and total sulfur dioxide.
To further explore, we could see the negative correlation between alcohol versus residual sugar and total sulfur dioxide.
From the above scatterplot, there is a positive correlation between free sulfur dioxide and total sulfur dioxide.
In this part, the correlation between quality and all chemical properties are found. alcohol (0.44) density (-0.31) chlorides (-0.21) volatile.acidity (-0.19) total.sulfur.dioxide (-0.17) fixed.acidity (-0.11) residual.sugar (-0.10) pH (0.1)
And the correlation of quality versus citric acid, sulphates and free sulfur dioxide is too small.
From the graph quality vs alcohol, we can distinguish white wine by the alcohol level. If the alcohol level is approximately 11% or above, we can conclude that the white wine has a very large chance to have quality at 7, 8 or 9.
From above, as alcohol and density have high correlation with white wine quality, the correlation between alcohol and density is also explored. It shows a high negative correlation(-0.78) between the two.
And to further explore, both residual sugar and total sulfur dioxide also have high positive correlation with density, and have high negative correlation with alcohol. This also implies the negative correlation between alcohol and density is true.
For Quality: Alcohol : 0.44 Density : -0.31
For non quality pairs: Density vs Residual Sugar: 0.84 Alcohol vs Density: -0.78
In this section, we try to explore more how to distinguish the white wine quality below 7, by exploring different chemical properties.
From the above graph, a line of alcohol level at 10.4% is drawn. We could be more clear that all white wine with quality 9 have alcohol level equal to 10.4% or above. And a large proportion of white wine with quality 8 has alcohol level above this line as well.
From the above graph, we could find out that good quality white wine are with lowest level of chlorides and with alcohol level higher than 10.4%. Whereas that with higher level of chlorides and with alcohol level lower than 10.4% are most likely to be of bad quality.
Chlorides level above 0.15g/dm^3 are very likely to have white wine quality 5 and 6.
From the above graph, we could see that white wines with good and bad quality(3, 7, 8, 9) usually have volatile acidity below 0.6g/dm^3. In other words, volatile acidity above 0.6g/dm^3 are likely to have quality at 4, 5 and 6. But actually there is no clear relationship between fixed acidity or volatile acidity versus quality of white wine.
From above, it is found that if the residual sugar is more than 20g/dm^3, the quality of white wine must be 5 or 6.
From the above, if the level of total sulfur dioxide is high, it is more likely to be white wines with lower quality
From the above, if the density is above 1.0025g/cm^3 or , the quality must be at 6.
From the above, unfortunately there is no clear relationship between pH and quality of white wine.
With the same level of free sulfur dioxide, higher level of total sulfur dioxide would normally have lower quality.
From the above graphs, we try find out how to distinguish the quality of white wines using chemical properties other than alcohol.
For density, the range is too narrow, which we could only distinguih the extreme density more than 1.0025g/cm^3 to be quality 6.
For chlorides, good quiality white wines usually have the lowest level of chlorides, whereas bad quality ones usually hav highest level of chlorides.
For residual sugar, only white wines with quality 5 and 6 would have residual sugar level more than 20g/dm^3.
For total sulfur dioxide, those of higher level of total sulfur dioxide should have lower quality.
For the correlation between free sulfur dioxide and total sulfur dioxide, it is determined from “Bivariate Analysis” that they have positive correlation. In this part, we further find out that if the white wine has the same level of free sulfur dioxide, that with a lower level of total sulfur dioxide tends to have higher quality.
This is the first graph that shows how to distinguish the quality of white wines. With a higher median alcohol level, the quality of white wine tends to be higher. In other words, high quality white wines usually have a higher percent of alcohol level.
Good quality white wines tend to have lower level of chlorides(below 0.05g/dm^3) and alcohol level above 10.4%. If the chlorides level is very high(above 0.15g/dm^3), the quality of white wines are very likely to be at 5 or 6.
By perception and correlation table, the level of total sulfur dioxide dose not directly contribute to the quality of white wines. But it contributs much to the level of alcohol, with a negative correlation (-0.45).
From the documentation, total sulfur dioxide is equal to the sum of free sulfur dioxide and bound sulfur dioxide. When we explore the above graph, it is found that with the same level of free sulfur dioxide, lower level of total sulfur dioxide tends to have a higher quality of white wine. It also implies that a lower free sulfur dioxide ratio in the white wine would usually have higher quality.
The data set contains 4,898 white wines with 11 variables on quantifying the chemical properties of each wine. At the beginning of this analysis, the distribution of all the 11 variables are plot, and almost all of them show a quite normal distribution after removing outliers. Only that of residual sugar shows a heavily right skewed distribution.
It is found that alcohol has the highest correlation with quality (0.44). Then we have further explored on the relationship between different chemical properties. Alcohol is highly negatively correlated to density(-0.78). And density is positively correlated to residual sugar(0.84) and total sulfur dioxide(0.53). When we looked further into total sulfur dioxide, we found that a lower free sulfur dioxide ratio in the white wine would usually have higher quality.
Among all the graphs we plot, we could distinguish between good and bad quality white wine using alcohol, chlorides and free sulfur dioxide ratio. Besides by looking at a certain level of chlorides and residual sugar, we could appoximately asertain that the quality is at normal level(around 5 and 6).
To conclude, with current dataset, it is relatively easier to distinguish between good and bad quality when certain chemical properties is at an extreme level. But it is very difficult to distinguish white wine between normal and good/bad quality. It is because there are too many chemical properties in a bottle of white wine, which a little bit difference in some of the chemical properties may not affect the taste obviously.
Imagine that if the dataset can include the year that produce the white wine, production country or even the name of the chateau, and increase the number of people to taste the white wine, there will be more information to determine the correlation and may be easier to distinguish between different quality of white wine.